-
Notifications
You must be signed in to change notification settings - Fork 70
Add test coverage for JobSet Workflow Orchestration #513
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
WalkthroughAdds test helpers to find Jobs and retrieve a single JobSet, extends the test client with a JobSet client accessor, and adds two end-to-end tests validating successful and failing JobSet initializer workflows using TrainingRuntime, TrainJob, and a shared PVC. Changes
Sequence Diagram(s)sequenceDiagram
participant Test
participant K8s as Kubernetes API
participant Client as Test Client
participant JobSetCtrl as JobSet Controller
participant Jobs as Replicated Jobs
rect rgb(240,248,255)
Note over Test,K8s: Setup resources
Test->>K8s: create Namespace & PVC
Test->>K8s: create TrainingRuntime (contains JobSet spec)
end
rect rgb(245,255,240)
Note over Test,JobSetCtrl: Execution
Test->>K8s: create TrainJob referencing TrainingRuntime
JobSetCtrl->>Jobs: instantiate replicatedJobs (dataset/model/node)
Jobs-->>JobSetCtrl: report status (succeeded / failed)
end
rect rgb(255,250,240)
Note over Test,Client: Verification & cleanup
Test->>Client: JobSet().List / GetSingleJobSet(namespace)
Test->>Client: GetJobByNamePattern(namespace, pattern)
Test->>K8s: assert TrainJob status (succeeded / failed)
Test->>K8s: delete TrainingRuntime
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes
Poem
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (2)
tests/common/support/jobs.go (1)
26-41: Tighten not‑found semantics and avoid pointer-to-loop-variable confusionTwo small improvements you might consider:
- Returning
nil, nilfor “not found” is a bit ambiguous; a signature like(*batchv1.Job, bool, error)(or documenting the nil,nil case clearly) would make call sites more robust.- For clarity, prefer taking the address of the slice element instead of the loop variable, e.g.:
-func GetJobByNamePattern(test Test, namespace, pattern string) (*batchv1.Job, error) { +func GetJobByNamePattern(test Test, namespace, pattern string) (*batchv1.Job, error) { test.T().Helper() jobs, err := test.Client().Core().BatchV1().Jobs(namespace).List(test.Ctx(), metav1.ListOptions{}) if err != nil { return nil, err } - for _, job := range jobs.Items { - if strings.Contains(job.Name, pattern) { - return &job, nil - } - } + for i := range jobs.Items { + if strings.Contains(jobs.Items[i].Name, pattern) { + return &jobs.Items[i], nil + } + } return nil, nil }This avoids the common “pointer to range variable” gotcha and makes the intent explicit.
tests/trainer/jobset_workflow_test.go (1)
131-357: TrainingRuntime / JobSet spec is consistent; consider small extraction to reduce duplicationThe
createTrainingRuntimeWithInitializershelper wires threeReplicatedJobs with the expected dependencies and sharedworkspacePVC, and the resource requests/limits look reasonable for tests. The structure is clear and matches the intended workflow.If you ever find this growing or needing reuse, you could optionally extract the common
Volume/VolumeMountandBackoffLimit/RestartPolicyNeverpieces into small local helpers to reduce repetition, but it’s fine as-is for readability.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
tests/common/support/jobs.go(1 hunks)tests/common/support/jobset.go(1 hunks)tests/trainer/jobset_workflow_test.go(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (3)
tests/common/support/jobset.go (2)
tests/common/support/test.go (2)
Test(34-45)T(90-102)tests/common/support/client.go (1)
Client(39-50)
tests/common/support/jobs.go (3)
tests/common/support/test.go (2)
Test(34-45)T(90-102)tests/common/support/batch.go (1)
Job(27-33)tests/common/support/client.go (1)
Client(39-50)
tests/trainer/jobset_workflow_test.go (9)
tests/common/support/test.go (3)
T(90-102)With(61-63)Test(34-45)tests/common/support/core.go (2)
CreatePersistentVolumeClaim(242-276)AccessModes(235-240)tests/common/support/jobset.go (1)
GetSingleJobSet(38-49)tests/common/support/support.go (3)
TestTimeoutShort(33-33)TestTimeoutLong(35-35)TestTimeoutMedium(34-34)tests/common/support/utils.go (2)
Log(34-34)Ptr(27-29)tests/common/support/trainjob.go (4)
TrainJob(26-32)TrainJobConditionComplete(42-44)TrainJobConditionFailed(46-48)TrainJobs(34-40)tests/kfto/environment.go (2)
GetAlpacaDatasetImage(36-38)GetBloomModelImage(32-34)tests/common/support/environment.go (1)
GetTrainingCudaPyTorch28Image(103-105)tests/common/support/client.go (1)
Client(39-50)
🔇 Additional comments (4)
tests/common/support/jobset.go (1)
27-48: JobSet helper is clear and idiomatic for test usageThe
jobsetOperatorGVRdefinition andGetSingleJobSethelper are straightforward and fit well with the existingTest/Gomega patterns. Asserting a single JobSet viaHaveLen(1)before returning the first item is reasonable in this test-only context.tests/trainer/jobset_workflow_test.go (3)
80-129: Failed workflow test logic looks solid and exercises JobSet status conditionsThe
TestFailedJobSetWorkflowpath correctly:
- Waits for the JobSet to exist via
GetSingleJobSet.- Polls
status.conditionswithunstructured.NestedSliceand explicitly errors when the field is missing.- Locates a
"Failed"condition withstatus == "True", captures the message, and asserts the TrainJob eventually transitions toTrainJobFailed.This gives good coverage of the failure behavior and ensures the TrainJob reflects JobSet failure as expected.
359-519: TrainJob helpers align well with initializer semanticsBoth
createTrainJobWithInitializersandcreateTrainJobWithFailingInitializer:
- Correctly reference the
TrainingRuntimeviaRuntimeRef.- Configure dataset/model initializer env vars that match what the initializer Jobs expect.
- Use clear shell commands to validate dataset/model visibility (success case) or ensure the trainer won’t run if the initializer fails (failure case).
These helpers nicely encapsulate the TrainJob setup for the two scenarios and keep the tests themselves concise.
521-534: Runtime cleanup helper is safe and test-friendly
deleteTrainingRuntimewraps the delete call and logs a warning instead of failing the test on cleanup errors, which is a good pattern for integration tests where teardown shouldn’t mask primary failures.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
tests/common/support/jobs.go(1 hunks)tests/common/support/jobset.go(1 hunks)tests/trainer/jobset_workflow_test.go(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- tests/common/support/jobset.go
🧰 Additional context used
🧬 Code graph analysis (2)
tests/trainer/jobset_workflow_test.go (10)
tests/common/support/test.go (3)
T(90-102)With(61-63)Test(34-45)tests/common/test_tag.go (1)
Tags(32-40)tests/common/support/core.go (2)
CreatePersistentVolumeClaim(242-276)AccessModes(235-240)tests/common/support/jobset.go (1)
GetSingleJobSet(38-49)tests/common/support/support.go (3)
TestTimeoutShort(33-33)TestTimeoutLong(35-35)TestTimeoutMedium(34-34)tests/common/support/utils.go (2)
Log(34-34)Ptr(27-29)tests/common/support/trainjob.go (4)
TrainJob(26-32)TrainJobConditionComplete(42-44)TrainJobConditionFailed(46-48)TrainJobs(34-40)tests/kfto/environment.go (2)
GetAlpacaDatasetImage(36-38)GetBloomModelImage(32-34)tests/common/support/environment.go (1)
GetTrainingCudaPyTorch28Image(103-105)tests/common/support/client.go (1)
Client(39-50)
tests/common/support/jobs.go (3)
tests/common/support/test.go (2)
Test(34-45)T(90-102)tests/common/support/batch.go (1)
Job(27-33)tests/common/support/client.go (1)
Client(39-50)
🔇 Additional comments (1)
tests/trainer/jobset_workflow_test.go (1)
56-74: LGTM: Eventually block correctly handles missing field.The code properly handles the case when
spec.replicatedJobsis not found by returning an error (lines 67-68), ensuring thatEventuallykeeps polling until the field exists. This addresses the concern raised in the previous review.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
tests/common/support/jobs.go(1 hunks)tests/common/support/jobset.go(1 hunks)tests/trainer/jobset_workflow_test.go(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (3)
tests/common/support/jobset.go (2)
tests/common/support/test.go (2)
Test(34-45)T(90-102)tests/common/support/client.go (1)
Client(39-50)
tests/common/support/jobs.go (3)
tests/common/support/test.go (2)
Test(34-45)T(90-102)tests/common/support/batch.go (1)
Job(27-33)tests/common/support/client.go (1)
Client(39-50)
tests/trainer/jobset_workflow_test.go (9)
tests/common/support/test.go (3)
T(90-102)With(61-63)Test(34-45)tests/common/support/core.go (2)
CreatePersistentVolumeClaim(242-276)AccessModes(235-240)tests/common/support/jobset.go (1)
GetSingleJobSet(38-49)tests/common/support/support.go (3)
TestTimeoutShort(33-33)TestTimeoutLong(35-35)TestTimeoutMedium(34-34)tests/common/support/utils.go (2)
Log(34-34)Ptr(27-29)tests/common/support/trainjob.go (4)
TrainJob(26-32)TrainJobConditionComplete(42-44)TrainJobConditionFailed(46-48)TrainJobs(34-40)tests/kfto/environment.go (2)
GetAlpacaDatasetImage(36-38)GetBloomModelImage(32-34)tests/common/support/environment.go (1)
GetTrainingCudaPyTorch28Image(103-105)tests/common/support/client.go (1)
Client(39-50)
🔇 Additional comments (5)
tests/common/support/jobset.go (1)
27-49: LGTM! Clean test helper for JobSet retrieval.The implementation correctly uses the dynamic client with the JobSet v1alpha2 API. Error handling is appropriate: the function returns errors from the List operation and uses a gomega assertion to verify exactly one JobSet exists, which is the expected behavior for a test helper.
tests/common/support/jobs.go (1)
26-42: LGTM! Loop variable issue correctly resolved.The past critical issue regarding taking the address of a loop variable has been properly fixed. The code now iterates by index and returns
&jobs.Items[i], avoiding the pointer-to-reused-variable bug.One minor observation: the function returns
(nil, nil)when no job matches the pattern. While this may be intentional for your test usage, consider documenting this behavior in a comment if it's not immediately obvious to future readers.tests/trainer/jobset_workflow_test.go (3)
56-74: Well done fixing theNestedSlicepolling logic.The past issue where
Eventuallycould succeed whenspec.replicatedJobswas missing has been correctly addressed. The code now properly separates the error check from thefoundcheck (lines 64-69), ensuring the polling loop continues until the field actually exists.
102-143: Excellent error handling improvements in the failure detection logic.The past issues with unchecked type assertions and ignored
NestedStringreturn values have been thoroughly addressed:
- Line 118: Type assertion now uses the comma-ok form with proper fallback
- Lines 122-125, 127-130, 132-138: All
NestedStringcalls now check bothfoundanderrbefore using valuesThe current implementation is robust and handles missing or malformed fields gracefully.
151-554: Well-designed test helpers and workflow structure.The helper functions create a realistic multi-stage JobSet workflow:
- Dataset initializer → Model initializer → Trainer node, with proper
DependsOnchains (lines 237-241, 307-311)BackoffLimit: 0throughout prevents retry noise in test runs- Shell scripts include appropriate validation (e.g.,
FAIL_ON_PURPOSEcheck at lines 190-193)- Cleanup helper logs warnings rather than failing tests (lines 549-553)
The test coverage effectively validates both success and failure paths for the JobSet orchestration.
|
@ChughShilpa These JobSet based sanity tests looks awesome 🚀 |
| Model: &trainerv1alpha1.ModelInitializer{ | ||
| Env: []corev1.EnvVar{ | ||
| { | ||
| Name: "MODEL_NAME", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nitpick - Model name is not needed as model is hardcoded.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (2)
tests/trainer/jobset_workflow_test.go (2)
174-179: Consider adding resource limits to initializers.Based on a previous review comment, resource limits should be added to the initializer containers for consistency with the node container (lines 298-307) and to prevent unbounded resource consumption.
Apply this diff to add limits to the dataset-initializer:
Resources: corev1.ResourceRequirements{ Requests: corev1.ResourceList{ corev1.ResourceCPU: resource.MustParse("100m"), corev1.ResourceMemory: resource.MustParse("128Mi"), }, + Limits: corev1.ResourceList{ + corev1.ResourceCPU: resource.MustParse("200m"), + corev1.ResourceMemory: resource.MustParse("256Mi"), + }, },Apply a similar change to the model-initializer resources at lines 244-249.
Based on past review comments.
Also applies to: 244-249
376-378: Optional: MODEL_NAME env var is not functionally used.As noted in a previous review comment, the
MODEL_NAMEenvironment variable is only used for logging at line 234, while the actual copy operation at line 238 hardcodes the path/models/bloom-560m. You could either remove the env var or update the script to use it.Example to make the script use the variable:
echo "Model: ${MODEL_NAME}" echo "Target path: ${MODEL_PATH}" echo "Copying Bloom-560M model to shared volume ..." mkdir -p ${MODEL_PATH} -cp -r /models/bloom-560m ${MODEL_PATH}/ +cp -r /models/${MODEL_NAME} ${MODEL_PATH}/ ls -la ${MODEL_PATH}/Based on past review comments.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
tests/common/support/client.go(6 hunks)tests/common/support/jobs.go(1 hunks)tests/common/support/jobset.go(1 hunks)tests/trainer/jobset_workflow_test.go(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- tests/common/support/jobset.go
🧰 Additional context used
🧬 Code graph analysis (2)
tests/common/support/jobs.go (3)
tests/common/support/test.go (2)
Test(34-45)T(90-102)tests/common/support/batch.go (1)
Job(27-33)tests/common/support/client.go (1)
Client(40-52)
tests/trainer/jobset_workflow_test.go (9)
tests/common/support/test.go (3)
T(90-102)With(61-63)Test(34-45)tests/common/support/core.go (2)
CreatePersistentVolumeClaim(242-276)AccessModes(235-240)tests/common/support/jobset.go (1)
GetSingleJobSet(27-38)tests/common/support/support.go (3)
TestTimeoutShort(33-33)TestTimeoutLong(35-35)TestTimeoutMedium(34-34)tests/common/support/utils.go (2)
Log(34-34)Ptr(27-29)tests/common/support/trainjob.go (4)
TrainJob(26-32)TrainJobConditionComplete(42-44)TrainJobConditionFailed(46-48)TrainJobs(34-40)tests/kfto/environment.go (2)
GetAlpacaDatasetImage(36-38)GetBloomModelImage(32-34)tests/common/support/environment.go (1)
GetTrainingCudaPyTorch28Image(103-105)tests/common/support/client.go (1)
Client(40-52)
🔇 Additional comments (5)
tests/common/support/client.go (1)
30-30: LGTM! Clean JobSet client integration.The JobSet client is properly integrated following the same pattern as other clients in the test framework. The implementation includes proper error handling and consistent naming conventions.
Also applies to: 44-44, 58-58, 82-84, 141-144, 185-185
tests/common/support/jobs.go (1)
26-42: LGTM! Helper correctly implemented.The function properly handles the loop variable pointer issue (previously flagged) by iterating with an index and returning a pointer to the slice element. The error handling and return patterns are appropriate.
tests/trainer/jobset_workflow_test.go (3)
37-73: LGTM! Well-structured end-to-end test.The happy-path test properly validates the JobSet workflow with initializers. The use of typed access (via
GetSingleJobSet) and proper cleanup withdeferstatements are good practices.
75-118: LGTM! Failure path test correctly validates error propagation.The test properly verifies that initializer failures propagate to the JobSet and TrainJob. The use of typed conditions is correct and avoids the unstructured field access issues that were previously flagged.
348-443: LGTM! Test helpers are well-implemented.The helper functions follow consistent patterns:
- Proper use of
test.T().Helper()for stack traces- Appropriate error handling and logging
- The failing initializer correctly uses
FAIL_ON_PURPOSE=trueto trigger the failure path in the runtime scriptAlso applies to: 445-508, 510-523
| defer deleteTrainJob(test, namespace, trainJob.Name) | ||
|
|
||
| // Verify JobSet is created with 3 replicated jobs | ||
| test.Eventually(func() error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When you use the JobSet client then I think this check can be made simpler 1-2 liner, using WithTransform to get replicatedJobs
| } | ||
| return nil | ||
| }, TestTimeoutShort).Should(Succeed()) | ||
| test.T().Log("JobSet is created with 3 replicated jobs (dataset-initializer, model-initializer, node)") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add here assertions verifying that dataset-initializer job is created (as the only job for now) and finished, after that model-initializer is created and finished and then node job is created and finished?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It may be needed to introduce some explicit sleep commands (a few seconds) in jobs to make sure that the assertions manage to verify it before job finishes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
tests/trainer/jobset_workflow_test.go (1)
37-73: Good test coverage for the success path.The test properly validates the JobSet workflow with initializers, including namespace setup, resource creation, and verification of successful completion.
However, there are two optional improvements you could consider:
- Simplify the JobSet verification (lines 56-66): As sutaakar suggested, you could use
WithTransformto make this check more concise:test.Eventually(func(g Gomega) { jobset, err := GetSingleJobSet(test, namespace) g.Expect(err).NotTo(HaveOccurred()) g.Expect(jobset.Spec.ReplicatedJobs).To(HaveLen(3)) }, TestTimeoutShort).Should(Succeed())
- Add sequential execution verification (per sutaakar's comment at line 67): You could add assertions to verify that jobs execute in the expected order (dataset-initializer → model-initializer → node), though this would require polling individual Job statuses and might need sleep commands in the jobs to make timing predictable.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
tests/trainer/jobset_workflow_test.go(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
tests/trainer/jobset_workflow_test.go (9)
tests/common/support/test.go (3)
T(90-102)With(61-63)Test(34-45)tests/common/support/core.go (2)
CreatePersistentVolumeClaim(242-276)AccessModes(235-240)tests/common/support/jobset.go (1)
GetSingleJobSet(27-38)tests/common/support/support.go (3)
TestTimeoutShort(33-33)TestTimeoutLong(35-35)TestTimeoutMedium(34-34)tests/common/support/utils.go (2)
Log(34-34)Ptr(27-29)tests/common/support/trainjob.go (4)
TrainJob(26-32)TrainJobConditionComplete(42-44)TrainJobConditionFailed(46-48)TrainJobs(34-40)tests/kfto/environment.go (2)
GetAlpacaDatasetImage(36-38)GetBloomModelImage(32-34)tests/common/support/environment.go (1)
GetTrainingCudaPyTorch28Image(103-105)tests/common/support/client.go (1)
Client(40-52)
🔇 Additional comments (5)
tests/trainer/jobset_workflow_test.go (5)
75-118: Excellent failure path coverage.The test properly validates that failures in initializer jobs propagate correctly through the JobSet to the TrainJob status. The condition checking logic is robust and handles missing fields appropriately.
120-354: Well-structured TrainingRuntime with proper resource management.The helper creates a comprehensive TrainingRuntime with three replicated jobs in the correct dependency order. All containers have both resource requests and limits specified, which is essential for proper scheduling and resource management in Kubernetes.
The shell scripts are clear and handle error conditions appropriately (e.g., the FAIL_ON_PURPOSE flag at lines 159-162).
356-451: TrainJob helper correctly implements the initializer pattern.The function properly configures both dataset and model initializers with environment variables, and the trainer command validates that the initialized assets are accessible.
Regarding sutaakar's comment about MODEL_NAME at line 385: While the image is hardcoded to
bloom-560m, the MODEL_NAME environment variable is functionally necessary because the model-initializer script (line 242 in the TrainingRuntime) uses${MODEL_NAME}to locate the model path:cp -r /models/${MODEL_NAME} ${MODEL_PATH}/.This design follows the initializer pattern where the TrainingRuntime provides generic, reusable scripts and the TrainJob supplies the configuration parameters. It makes the runtime template more flexible for future use cases.
453-516: Failure test helper correctly configured.The function properly sets up a TrainJob that will fail by injecting the
FAIL_ON_PURPOSE=trueenvironment variable, which triggers the error condition in the dataset-initializer script. The minimal trainer command is appropriate since the job should fail before reaching that stage.
518-531: LGTM!The cleanup helper follows best practices with proper use of
test.T().Helper()and appropriate logging for both success and failure cases. Using log warnings rather than assertions for cleanup failures is correct for deferred cleanup operations.
Add test coverage for JobSet Workflow Orchestration
Description
How Has This Been Tested?
go test ./tests/trainer/ -run TestJobSetWorkflow -vMerge criteria:
Summary by CodeRabbit
Tests
Chores